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AES'^BACT ^ 

A framework, or classification scheme, is' provided 
for displayinq the spectrum of criterion -Referenced tests. This 
framework illustrates that no single type of test can be identified 
as ^-he absolute prototype criterion-referenced test It is shown that 
over the past 115 years criterion-referenced testing has grown 'to be 
a many-faceted concept, its multitude of specific instances differing 
qualitatively from each other, A definition is provided for a 
criterion-referenced test as one that is deliberately constructed to 
yield meateureirent s that are d-irectly inter pretable in terras of 
specified performance standards. Three broad categories are 
identified (well-defined domains, ill-defined domains, or" those that 
are basically undefined) , and the relationships are clarified. The 
classification scheme for distinguishirq the varieties of these tests 
is presented in tabular form according to the basis for test 
development. Other tables give examples of criterion -referenced 
tests, and summarize additional breakdown of categories of tests 
which reference scores to unordered dcm^^-ins. Essentially, this 
framework is seen as a first step toward clarity of communication 
among professionals and toward ijnproved test development. 
(Authcr/GSK) 
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Individual Differences Between 
Criterion-Referenced Tests * 

In 1960, the term criterion-referenced testing was unknown to educators, 
psychologists, and measurement specialists. By 1978, however, over 600 

?«!lInn"?Q7Jj^'''^ (Hambleton, Swaminathan, Algina, and 

Coulson, 1978) and the term was basing used by many school people and most 
educational testing specialists. However; as can be testified by many who 
have used and worked with this type of testing, there exists considerable 

i*^^ r ^° "^^^^ ^® "^^"^ criterion-referencing. William Gray 
Uy78;, for Example, did a content analysis of 57 descriptions of criterion- 
rtferencing written by nearly 40 authors and fbund that not only do different 
W^ipers use the term differently, the same writers are sometimes inconsis- 
tent within the same article. This clearly demonstrates that there is no ' 
single agreed upon definition ctf criterion-referencing. 

The purpose of this. paper is to provide a framework or classification 
scheme that can be used to display the spectrum of criterion-referenced 
tests. This framework illustrates that no single type of test can be 
identified as the prototype criterion-referenced test. It illustrates that 
over the years criterion-referenced testing has emerged as a many-faceted 
concept, having a multitude of specific instances, that differ qualita- 
tively from each other. 

One reason for considering such a taxonomlc classification of the area 
Is that it is one step on the road to systemization. Systematizing a body 
of knowledge can lead to advancing the work- in an area by pointing out -what 
has been done and whaTt yet remains 'to be done. Also, it permits the 
similarities and differences among the works of various Investigators to be 
displayed even though their rhetorics have long since become amalgamated 
in common useage . 

Basic Distinctions 

A broad definition can be used to distinguish criterion-referenced tests 
from others (Glaser and Nltko, 1971, p. 653): 

A criterion-ref erenced test is one that is deliberately 
constructed to yie ld meas urements tha*, are directly 
In t e r p r g t ab le_ln terms^^^jieyiled perfo rmance standards . 
Performance standards are generally specified by defining 
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a class or domain of tasks that should be ptrforned 
by the Individual. Maasurejinents are taken on rep- 
resentative sanples of tasks* drawn from this dot&aln, 
^d such measures arf referenced directly to this 
^ domain for each individual measured. [Crlterloh- 

referenced tests] are specifically constructed 
to support generalizations about an individual's 
performance relative to a spe- ^fic domain of tasks 
.... The term task includes both content and 
process . 

Many types of achievement tests seem to fit this broad definition,^ The 
first step taken to distinguish among them is to characterise the manner in 
which each defines the domain of behaviors to which an examinee's test 
performance is to be referenced. Three broad categories can be identified: 
Doroairs are either well-defined t ill-defined , "or basic ally u ndef:*ned e Tests 
falling into the latter two categories do not qualify as crlterion^-referenced 
teists under the broad definition adopted h'^re, even though the test developers 
may claim oth'erwise. 

Examples of tests based on ill-defined domains Include: 

(1) Tests developed from 'behavioral objectives" 
that are so poorly written and ambiguous 
that it is not possible to know to which v 
domains of behavior a test score can be 
referenced, or 

(2) Tests developed in such a way that the, domain 
is defined only in terms of the particular 
items on the test, so that the broader 
generalizations, which are required for 
decision-making, cannot be m&de. 

Tests based oti Ill-defined domains h?ve been called "cloud-referenced tests" 
by Popham (197A; • 

"' *■ ** . 

Some tests reft -^d to as criterion-referenced simply do not define a 

domain of behavior and, thus, such tests cannot form the basis for referenc- 
ing test performance in the manner considered here. Frequently, when you 
encounter such tests, you will notice that the test developer has confused 
the idea of a cut-off score with the idea of c?:iterion-referencing an 
examinee's score to a domain of instructlonally relevant performance. 
These kinds of tests might be called pseudo-referenced tests and represent 
a misapplication of the idea of ^criterion-referencing as described here- 

A dom; .n is well-defined when It is clear to both the test developer 
and the test user which categories of performance (or which kinds of tasks) 
are and are not to be considered as potential test items, Well-definnd 
domains are a necessary condition for criterion-referencing since the basic 
idea is to generalize how well a student can perform in a brorder class of 
behaviors, only a few of which happen to appear on a particular test form. 
Since test development includes much more than defining a domain, the con- 
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ditlon of domain definition, while necessary, is not sufficient. 



Insert TaSle 1 About Here 



The Classification Scheme 

Table 1 shows ".he scheme for classifying various kinds of criterion- 
referenced tests. Notice that the column headings identify the ways to 
characterize domains on which teists are built. In the scheme, well-defined 
domains are further sub-divided into ordered and unordered domains. This 
distinction Is a fundamental one and is based on a cohception that in some 
cases the behaviors in a domain can be ordered along a continuum of ' achieve- 
ment such as that which Glaser and Klaus (1962; Glaser, 1963) and others 
have spoken. For example, one can think of prdering behavior in a sequence 
of learning prerequisites so that a test can be built which references a 
student's performance to this sequence. One use of such a test would be to 
provide information about what has been already learned and thus permits 
the planning of the next stage of instruction. Five different basis for 
ordering are listed in the table and perhaps there are more. These dif- 
ferent-bases will be described and Illustrated shortly. 

The scheme illustrated in Table 1 shows four broad categories for 
classifying criterion-referenced tests that are developed from well-defined 
but unordered domains. Perhaps other ways not included in the table could 
be listed, too. 

Before describing and illustrating the various kinds of criterion- 
referenced tests that fall into each of these categories, It should be noted 
that tho- basis used to categorize them was not Just the original authors' or 
test developers' definitions, but a broader consideration of (a) the testa 
themselves, (b) the manner in which they were produced, and (c) the overall 
context of the authory' discussions of them. As a result of this, you will 
notice that there are included several tests or suggestions which have not 
been identified previously as criterion-referenced. Some of these existed 
before Glaser and Klaus invented the term. Other tests and procedures are 
identified as criterion-referenced — even though their authors explicitly 
deny that they are — if they appear to satisfy the broad definition adopted. 
Frequently, authors deny the.r association with criterion-referencing be- 
cause they disagree with onf particular definition, interpretation, or 
application, but they fail to be specific about the nuance of the concept 
with which they disagree. One possible outcome of the present treatment is 
to display the basis for many of these disagreements and to provide a frame- 
work into which a wide range of suggestions for test improvement may have 
pluriaxlal existence. Still another outcome of this process is to identify 
distinctions between various types of criterion-referencing that the original 
authors themselves ha^e either not Tecognized ur have not tr.ade explicit. 
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Table 1. A scheme foV classifying and distinguishing 
the many varieties of tests that have been called 
criterion- referenced 



How the domain of behavior for achievement testing is characterized 



Well-defined and ordered domains 



Well-defined but 
unordered domains 



Ill-defined domains 



Un- defined 
domains 



Ordering based on judgements of the social or Specifying the stimulus 
aesthetic quality of an examinee ^s product or properties of the items 
performance. . / to be included in the 

domain • 



iuct 

/ 

Ordering based on which level of diffic'^ty 
or complexity a topic or subject is learnei^r"'*" 



Specifying the stimuli 
Imd^the responses in 
the domain. 



Ordering based on degree. of proficiency 
with which a complex skill is performed. 



Ordering based on prerequisite sequences 
for acquiring an intellectual or psycho- 
motor skill. 



Ordering based on an empirically defined 
latent test. 

Ordering on other bases are possible. 



Specifying ^the 
"diagnostic" 
categories of 
the domain. 

Specifying the^abstrac- 
tions, traits, or con- 
structs that define the 
domain • 

Other ways of. specifying 
the domain are 'possible. 



Poorly articulated 
l?ehavioral objec- 
tives 



Defining the 
domain only in 
terms of the 
particular items 
on the test. 



No attempt 
to define a 
domain to 
which to 
reference 
test perfor- 
mance. 

Using a cut- 
off score 
but not 
defining a ' 
performance 
domain. 



O 



Table 2 

VariQus categories of criterion-referenced 
tests based on well-defined and ordered domainfe 



Basis for scaling or 
ordering the ^^efined 
domain of behavior* 



Judged social or aesthetic 
quality of the perform- 



ance 



Complexity or difficulty 
level of the subject- 
matter 



Degree of proficiertcy with 
which complex skills are 
performed 



Examples 



Rev. Geoyge Fisher's Scale Books 
, (1864) 



E. L. Thornriike ' s Handwriting (1909) 
and Drawing (1913) Scales 



Ayre.'s Spelling Scale (1915) 

uvaser's Criterion- Referenced 
Measures I (1962, 1963) v 

Cox and Graham's Arithmttic Scale 
(1966) 



Harvard-Newton English Composition 
Scales (1914) 

Glaser'a Criterion-Referenced 
Measures II (1962, 1963) 



Perhaps certain sports events or 
physical fitness tests 



Prerequisite sequence for 
acquiring intellectual 
and psychoirotor skills 



Location on an empirically 
defined latent trait 



Gagn^'s Learning Hierarchies (1962). 

Piagetian Development Scales (Gray, 
1978) 



Infant Development Scales (Uzgiris & 
Hunt, 1966) 



Connolly, Nachtman, and Prichett 
arithmetic tests (1971) 

Other teats build with latent trait 
models (e.g., Rasch, 1960 or Birn- 
baum, 1968) provided they are re- 
ferenced to well defined and 

g ordered domains of behavior. 



*Other bases for scaling are possible as well; ''Examples are meant to be 
illustrative rather than representative or exhaustive. 



Table 2. shows and gives referenceb t6 exaioples of crlterlofi-referenced 
tests* ^f^hat are based on vell-*de fined and ordered doiqainSt Notice that the . 
examples span roughly lOQ years front Rev. George Fishet's 1864 Scale 
Books to some current conceptions of crlterion-refereneing that use one of 
the latent trait theory models. 

Among the distinctions made, is the one between the two types of 
orderlngs advocated by Glaser (1963; Glaser and KlaUs, 191^2). One ordering 
is based oi^ subject-matter difficulty or complexity: Tan examinee's score' 
is scaled to reveal to which level of difficulty or to which level of com- 
plexity a topic or subject has been learned • Figure 1 sh^va an example of 
this bc^sed 6n a simple addition scale developed by Cox and Graham (1966) 
to identify *the most complex type of problem a child could perform. (The 
figure also Illustrates the idea that nonft'**referenclng Is not Incompatible 
with criterion-referencing.) 

a' second type of ordering advocated by Glaser Is based on the degree 
of proficiency with which complex intellectual or psychomotor skills can 
be performed. A summary of the various ways proficient experts perform 
differently from novices has been provided by Chi and Glaser (in press) . 
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Table 3 summarizes an additional break-down of the categories of tests 
which reference scores to unordered domains. There are four broad categories 
which distinguish the tests according to both the manner and specificity with 
which the domains are defined. Within each category there are certain nuances 
These within category nuances havo in common essentially the same basis for 
defining and delineating a dpmain, but each emphasizes a 8ome|||yLt different 
perspective or aspect* The time constraints of this presentatiofT'do not per- 
mit a full discussion of each category, but some comments can be made.' ^ 

With regard to the first category— stimulus properties and sampling 
plans — It should be noted that for purposes of test .development, It is 
necessary to use intuition or a "theory of performance" to specify those 
stimulus properties that would likely cause behavior to vary and, hence, 
that ought to be taken into account when sampling from the domain. Thus, 
while focus is on stimulus characteristics, response characteristics are not 
neglected. When a theory of performance is crude or undeveloped, stratifica- 
tion and sampling follow suit;. 

It should be noted, too, that Ebel's (1962) content-standard scores are 
placed in this first category. Although Glaser (1963) has pointed to the 
similarity between his proposal and Ebel's content-standard scores, Ebel 



In the psychometric literature such ordering of performance is often 
referred to as scaling. Scaling implies establishing a metric as well as 
determining ordlnality (cf, Angoff, 1971). 



Table 3 



Varioufl categories of criterion-referenced tests 
based on well-defined but unordered domains 



Basis for delineating 
the behavior dotnaln^ 



Stimulus ''Properties 
of the Domain and 
the Sampling Flan 
of the Test 



During test develop- 
ment emphasis Is 
placed on: 



Examples 



Defining content 
and content 
strata 



Starches English Vo- 
cabulary Test (1916) 

Ebel*s Content-stand- 
ard English Vocabu- 
lary Test ' (1962) 



Specifying stimulus 
properties of 
item domains 



Hlvely's Item Forms 
(1966, 1968) 

Osburp's Item. Forms 
(1968) 



Specifying the pre- 
cise reliationshlp. 
between instruc- 
tional content and 
item domain 



Bormuth's trans for-, 
matlonal Rules (1970) 



Verbal Statements of 
Stimuli and Responses 
In Domain 



Behavioral objectives 
with or without 
the cut-off score 
('•criterion") speci- 
fied 



Tests based on Mager*s 
Type of Objectives 
(1962) 

Curriculum Embedded 
Tests of IPI Mathe- 
matics (1967) 



Popham and Hueek's 
Criterion-Referenced 
Testing (1969) 

* Karris and Stewart's 
Criterion-Referenced 
Testing (1971) 



Elaborated descrip- Popham's Criterion- 
tions ot behaviors Referenced Tests 
and stimuli ^ (1973, 1^78) 
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lOX Objtctives-Based 
Testa (1972) : Ampli- 
fied Objectives 

lOX Test Specifications 
(1978) 



a h 

Other baaes for delineating exist; Examples are meant to be illustrative 
rather othan representative or eschaustlve. 



^9- 
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..^Tablfe 3^contlnued) 



' , During test develop- 
Basls"for delineating, inent emphasis' is 
'th?-; behavior Remain* ' 'placed on: ' * 



Examples 



"Diagnostic" 
Categories of. 
Performance 



Identifying entry' 
leyel behaviors • 



Hunt and. Kirk Tests - 
of School '.R^diness . 
(19 7A) ' . . 



Identifying behavior , Tests build on 

Resnick's Component 
Analysis (1973). 



Gagn^*s Two Stage 
Testing (1970) 



Identifying and 
categorizing 
erroneous responses 



"Tab-Item" Technique 
(1954), 

Nesbit's CHILD Program 
(1966) 

Hsu!s Computer - 
Assisted Diagnostic 
Tests (1972) 



Identifying erron- 
eous processes 



Beck's Blending 
Algorithm (1972). 

Interviews to determine 
what processes were 
used in responding 



Abstractions, Traits 
or Constructs 



Specifying specific 
behaviors or cate- 
gories of behaviors 
that delimit the 
abstraction, trait, ^ 
or construct 



Tests ba^'^ed on the 
Taxonomy of Educational 
Objectives (1956) 

Certain basic skills 
survey tests, e.g. , 
ITBS, MAT 



a K 
Other bases for delineating exist; Examples arc meant to be illustrative 
rather than representative or exhaustive. 
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has argued againat criterion-referenced teats while gtlll Implying' the use- 
fulness of content-standard scores (Ebel, ,1970.; 1971; 1978). It is not 
clear vhether this debate centers .on the wh61e of criterion-referenced 
testing or only on certain vari-fties of these^ tests. 
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, V, . It the second category — delineating a domain by specifying stimuli 
and responses — that !.has received the most publicity, professional attention, 
and' practical work* Further, most discussions of criterion-referenced testing 
have assumed that it is necessary to use, some vatiety of behavioral Qbjectives*^ 
in order to develop a criterion-referenced test. 

As is indicated by the third category , "diagnosis" ha? taken a variety 
of forms including identifying such aspects as (a) entry level '"behaviors , 
(b) missing component behaviors, (c) erroneous responses, and (d) erroneous 
processes. 

Perhaps the most controversial category is the fourth. Tests in this 
category specify the domains in terms of abstractions, traits, ^ or constructs 
and frequently use flne-grain6d, behavioral objectives as well. The categories 
of the Bloom, et al, (1956) Taxonomy i for example, refer mainly to internal 
processes or psychological constructs (e.g., see Cronbach, 1971). Reading 
comprehension or spelling ability arc other examples of constructs or traits* 

It may be thought that these tests are really "cloud-referenced" — and 
indeed -there are some tests for which this seems true. ^But the distinction 

^ here i§ that if the tests do have reasonably well-defined domains of in- 
structionally relevant behaviors , they tall within the dcQ£e of the broad 
definition of criterion-referencing adopted here. If the developers of such' 
tests choose to d^^fine^ these domains in terms of abstractions or constructs, 
rather than narrow stimulus/response classifications, perhaps this may 
diminish the usefulness of the tests for ^certain purp'bses iij particular 
ln8truot;Lonal progrlams. Nevertheless, *for many such tests, the descriptions 
of the domains arfe understood by most teachers and educators; and, in that ^ 

',i?en3e, can be considered to be well-defined. 

u Implications 

Among the implications of such a classification scheme for criterion- 
referenced tests are these. 

- «^ ^ 

1. \'Tt is generally recognized by the profession- that there are many 
poorly, constructed criterion-referenced tests and some attempt has been made 
develop £ets of guidelines or standards for evaluating them (e.g., Hamble.ton, 
and,Elgnpr> 1978). Unless care is taken, however, sucH guidelines are likely 
to focutf on only one or two varieties of criterion-referencing. The result 
may be that many useful tests are judged unfairly. Further, the concept of 
criterion-referencing that is communicated through such stEndards, to the 
profession and to users of tests, is likely to be woefully incomplet^e unless 
a broad view, is taken*' 

2. Traditional concepts of reliabilttv and validity appear much more 
relevant to the total field of criterion-referencing than has been previously 
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admitted. Looming over the whole area, for example, is the notion of construct 
validation. Among the many interpretations that are advocated for different 
kinds of criterion-referenced tests are such ideas as: mastery vs. non-mastery, 
expert vs. novice, hierarchical learning sequence, and a host of diagnostic 
categories each with specific implications • Criterion-oriented validity Is 
another traditional concept that appears applicable to many of the tests 
the field which lay claim to being useful for labeling or diagnosing children 
or for assigning' them to variousi qualitatively different instructional 
treatments. 

3. A third Implication is that in order for professional work in this 
area to continue it is necessary to make carpful distinctions among various 
types of tests and to avoid intermingling substantively different rhetorical 
arguments. One reason why there are many definitions and varieties of 
criterion-referenced tests is because of the chan|;ing nature of the concept 
as well as how it has been shaped by various appll>^ations • This Is to be 
expected in any aiaa that Is growing and maturing ax^d need not be disconcert- 
ing if reasonable care is taken to properly reference one's definitional 
source . ^ 

4* Criterion-referenced testing started as a movement to make tests 
more related to the kinds of information needed for effective instructional 
decisions (Glaser, 1963)* One can ask which", if any, of these many types 
of criterion-referenced tests are able to fulfill that original intention 
and which types are compromising it. Further critical analysis is needed 
to identify the potential of each type f'^r the improvement of the learning 
environment and further work should be done where necessary. 

5. Two very popular definitions, Glaser *s (1963; Glaser and Klaus, 
1962) and Popham's (1975; Pophara and Husek, 1969), appear as quite different d 
in intention* Most of the psychometric work on criterion-referencing has 
been directed toward the Popham ideas which consider primarily domains or 
collections of ecsentially unordered behaviors or tasks. The testing problem 
is seen as one of estimating an examinee's status on a domain, Udually in 
terms of a proportion of tasks that can be performed in the entire domain. 
On the other hand, much of the relevant work by psychologist in such areas 
as cognitive processes, novice/expert distinctions, and problem solving 
strategies has been more in line with the Glaser concept of an ordered 
domain. There appears, then, to be a continuing need for intercourse be- 
tween the three kindgoms of Educationdom, Learningvsnla, and Psychometrica 
(Glaser, 1969) on this matter of what is important to test and how to go 
about testing it. Perhaps a scheme such as the one presented here will be 
a first step toward clarity of communication in this regard and in regard . 
to some of the other measurement issues raised. 
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This paper provides a framework for integrating the ntany conceptions of 
criterion-referenced testing. The categories of the scheme that is developed 
are illustrated with examples of criterion-referenced tests which have been 
developed over a period of one and a quarter centuries: between 1864 and 
1978. The scheme illustrates how various conceptions of criterion-referenced 
testing are both similar to and different from each other. This clarifies 
the relationships between some well-known and popular definitions. The 
framework is seen as a first step toward clarity of communication among 
professionals and toward improved test developineut . 
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